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. Abstract 

The quality of data representation in deep learning methods is directly related to 
lO ■ the prior model imposed on the representations; however, generally used fixed 

priors are not capable of adjusting to the context in the data. To address this issue, 
we propose deep predictive coding networks, a hierarchical generative model that 
empirically alters priors on the latent representations in a dynamic and context- 
sensitive manner. This model captures the temporal dependencies in time- varying 
. signals and uses top-down information to modulate the representation in lower 

c/5 ■ layers. The centerpiece of our model is a novel procedure to infer sparse states of a 

O I dynamic network which is used for feature extraction. We also extend this feature 

extraction block to introduce a pooling function that captures locally invariant 

■ representations. When applied on a natural video data, we show that our method 
^ \ is able to learn high-level visual features. We also demonstrate the role of the top- 
down connections by showing the robustness of the proposed model to structured 

' noise. 

m 
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1 Introduction 

o ■ 

^T) • The performance of machine learning algorithms is dependent on how the data is represented. In 

most methods, the quality of a data representation is itself dependent on prior knowledge imposed on 
the representation. Such prior knowledge can be imposed using domain specific information, as in 
SIFT UJ, HOG L2], etc., or in learning representations using fixed priors like sparsity [3], temporal 
coherence fl], etc. The use of fixed priors became particularly popular while training deep networks 
^ ■ ll-lsl]- In spite of the success of these general purpose priors, they are not capable of adjusting to 

the context in the data. On the other hand, there are several advantages to having a model that can 
"actively" adapt to the context in the data. One way of achieving this is to empirically alter the 
priors in a dynamic and context-sensitive manner. This will be the main focus of this work, with 
emphasis on visual perception. 

Here we propose a predictive coding framework, where a deep locally -connected generative model 
uses "top-down" information to empirically alter the priors used in the lower layers to perform 
"bottom-up" inference. The centerpiece of the proposed model is extracting sparse features from 
time- varying observations using a linear dynamical model. To this end, we propose a novel proce- 
dure to infer sparse states (or features) of a dynamical system. We then extend this feature extraction 
block to introduce a pooling strategy to learn invariant feature representations from the data. In line 
with other "deep learning" methods, we use these basic building blocks to construct a hierarchical 
model using greedy layer- wise unsupervised learning. The hierarchical model is built such that the 
output from one layer acts as an input to the layer above. In other words, the layers are arranged in a 
Markov chain such that the states at any layer are only dependent on the representations in the layer 
below and above, and are independent of the rest of the model. The overall goal of the dynamical 
system at any layer is to make the best prediction of the representation in the layer below using the 
top-down information from the layers above and the temporal information from the previous states. 
Hence, the name deep predictive coding networks (DPCN). 



>< 



1 



1.1 Related Work 



The DPCN proposed here is closely related to models proposed in ['5', TO], where predictive cod- 
ing is used as a statistical model to explain cortical functions in the mammalian brain. Similar to 
the proposed model, they construct hierarchical generative models that seek to infer the underlying 
causes of the sensory inputs. While Rao and Ballard |9] use an update rule similar to Kalman filter 
for inference, Friston [|Tq|] proposed a general framework considering all the higher-order moments 
in a continuous time dynamic model. However, neither of the models is capable of extracting dis- 
criminative information, namely a sparse and invariant representation, from an image sequence that 
is helpful for high-level tasks like object recognition. Unlike these models, here we propose an 
efficient inference procedure to extract locally invariant representation from image sequences and 
progressively extract more abstract information at higher levels in the model. 

Other methods used for building deep models, like restricted Boltzmann machine (REM) ifTTIl . auto- 
encoders (^,^V2] and predictive sparse decomposition [13], are also related to the model proposed 
here. All these models are constructed on similar underlying principles: (1) like ours, they also use 
greedy layer- wise unsupervised learning to construct a hierarchical model and (2) each layer consists 
of an encoder and a decoder. The key to these models is to learn both encoding and decoding 
concurrently (with some regularization like sparsity [|T3ll . denoising |8] or weight sharing [[Till ), 
while building the deep network as a feed forward model using only the encoder. The idea is 
to approximate the latent representation using only the feed-forward encoder, while avoiding the 
decoder which typically requires a more expensive inference procedure. However in DPCN there is 
no encoder. Instead, DPCN relies on an efficient inference procedure to get a more accurate latent 
representation. As we will show below, the use of reciprocal top-down and bottom-up connections 
make the proposed model more robust to structured noise during recognition and also allows it to 
perform low-level tasks like image denoising. 

To scale to large images, several convolutional models are also proposed in a similar deep learning 
paradigm [5-7]. Inference in these models is applied over an entire image, rather than small parts of 
the input. DPCN can also be extended to form a convolutional network, but this will not be discussed 
here. 



2 Model 

In this section, we begin with a brief description of the general predictive coding framework and 
proceed to discuss the details of the architecture used in this work. The basic block of the proposed 
model that is pervasive across all layers is a generalized state-space model of the form: 

ft = J^{^t) + nt 

Xt = g(Xt_i,Ut) + (1) 

where ft is the data and T and Q are some functions that can be parameterized, say by 6. The terms 
Ut are called the unknown causes. Since we are usually interested in obtaining abstract information 
from the observations, the causes are encouraged to have a non-linear relationship with the obser- 
vations. The hidden states, x^, then "mediate the influence of the cause on the output and endow 
the system with memory" [10 ]. The terms and are stochastic and model uncertainty. Several 
such state- space models can now be stacked, with the output from one acting as an input to the layer 
above, to form a hierarchy. Such an L-layered hierarchical model at any time 't' can be described 
a£]: 

uf-^) = ^(xf)) + n(') Wg{1,2,...,L} 

xf) = g(xi!^„uf))+vf) (2) 

The terms ^ and n[^^ form stochastic fluctuations at the higher layers and enter each layer in- 
dependently. In other words, this model forms a Markov chain across the layers, simplifying the 
inference procedure. Notice how the causes at the lower layer form the "observations" to the layer 
above — the causes form the link between the layers, and the states link the dynamics over time. 
The important point in this design is that the higher-level predictions influence the lower levels' 

^When / = 1, i.e., at the bottom layer, u^*~^^ —yt, where yt the input data. 
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(a) Shows a single layered dynamic network 
depicting a basic computational block. 



(b) Shows the distributive hierarchical model formed by 
stacking several basic blocks. 



Figure 1 : (a) Shows a single layered network on a group of small overlapping patches of the input 
video. The green bubbles indicate a group of inputs (y^"^^ , Vn), red bubbles indicate their corre- 

(n) 

sponding states (x^ and the blue bubbles indicate the causes (u^) that pool all the states within 
the group, (b) Shows a two-layered hierarchical model constructed by stacking several such basic 
blocks. For visualization no overlapping is shown between the image patches here, but overlapping 
patches are considered during actual implementation. 



inference. The predictions from a higher layer non-linearly enter into the state space model by em- 
pirically altering the prior on the causes. In summary, the top-down connections and the temporal 
dependencies in the state space influence the latent representation at any layer. 

In the following sections, we will first describe a basic computational network, as in ([T]) with a 
particular form of the functions T and Q. Specifically, we will consider a linear dynamical model 
with sparse states for encoding the inputs and the state transitions, followed by the non-linear pooling 
function to infer the causes. Next, we will discuss how to stack and learn a hierarchical model using 
several of these basic networks. Also, we will discuss how to incorporate the top-down information 
during inference in the hierarchical model. 



2.1 Dynamic network 

To begin with, we consider a dynamic network to extract features from a small part of a video 
sequence. Let {yi,y2, ...,yt, ...} G be a P-dimensional sequence of a patch extracted from 
the same location across all the frames in a videcQ . To process this, our network consists of two 
distinctive parts (see Figure JTa]): feature extraction (inferring states) and pooling (inferring causes). 
For the first part, sparse coding is used in conjunction with a linear state space model to map the 
inputs yt at time t onto an over-complete dictionary of i^T-filters, C G W^-^{K > P), to get 
sparse states G M^. To keep track of the dynamics in the latent states we use a linear function 
with state-transition matrix A G R^^^. More formally, inference of the features Xt is performed 
by finding a representation that minimizes the energy function: 

Ei(x,,y,,C,A) = ||y,-Cx,||^ + A||x,-Ax,_i||i+7||x,||i (3) 

Notice that the second term involving the state-transition is also constrained to be sparse to make 
the state- space representation consistent. 

Now, to take advantage of the spatial relationships in a local neighborhood, a small group of states 

x^^\ where n G {1,2, ...N} represents a set of contiguous patches w.r.t. the position in the image 
space, are added (or sum pooled) together. Such pooling of the states may be lead to local translation 
invariance. On top this, a I^-dimensional causes Ut G are inferred from the pooled states to 
obtain representation that is invariant to more complex local transformations like rotation, spatial 
frequency, etc. In line with lH^ . this invariant function is learned such that it can capture the 
dependencies between the components in the pooled states. Specifically, the causes are inferred 



^Here yt is a vectorized form of ^/P x ^/P square patch extracted from a frame at time t. 
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by minimizing the energy function: 



N 



K 



^2(u*,x„B) = ^(^|^,.a;g|)+/3||u,|K 
l + exp{-\B\xt]k)' 



(4) 



7/c 



70 



where 70 > is some constant. Notice that here multipHcatively interacts with the accumulated 
states through B, modehng the shape of the sparse prior on the states. Essentially, the invariant 
matrix B is adapted such that each component connects to a group of components in the ac- 
cumulated states that co-occur frequently. In other words, whenever a component in is active 

in) 

it lowers the coefficient of a set of components in , Vn, making them more likely to be active. 
Since co-occurring components typically share some common statistical regularity, such activity of 
ut typically leads to locally invariant representation rtl4ll . 

Though the two cost functions are presented separately above, we can combine both to devise a 
unified energy function of the form: 



N K 

E(x,,u,, 0) = ^ (i||y(") - Cx^)||l + A||x(") - Ax(:!\ 111 + ^ |7,fc • 



k=l 



(5) 



7t,fc =70 



1 + exp{-\B\it]k) 



where 6 = {A,B,C}. As we will discuss next, both xt and ut can be inferred concurrently from 
(O by alternatively updating one while keeping the other fixed using an efficient proximal gradient 
method. 



2.2 Learning 

To learn the parameters in (|5]l, we alternatively minimize £^(x^, Ut, ^) using a procedure similar to 
block co-ordinate descent. We first infer the latent variables (xt, Ut) while keeping the parameters 
fixed and then update the parameters 6 while keeping the variables fixed. This is done until the 
parameters converge. We now discuss separately the inference procedure and how we update the 
parameters using a gradient descent method with the fixed variables. 

2.2.1 Inference 

We jointly infer both xt and Ut from ^ using proximal gradient methods, taking alternative gradient 
descent steps to update one while holding the other fixed. In other words, we alternate between 
updating x^ and using a single update step to minimize Ei and E2, respectively. However, 
updating x^ is relatively more involved. So, keeping aside the causes, we first focus on inferring 
sparse states alone from Ei, and then go back to discuss the joint inference of both the states and 
the causes. 

Inferring States: Inferring sparse states, given the parameters, from a linear dynamical system 
forms the crux of our model. This is performed by finding the solution that minimizes the energy 
function Ei in © with respect to the states x^ (while keeping the sparsity parameter 7 fixed). 
Here there are two priors of the states: the temporal dependence and the sparsity term. Although 
this energy function Ei is convex in x^, the presence of two non-smooth terms makes it hard to 
use standard optimization techniques used for sparse coding alone. A similar problem is solved 
using dynamic programming [Hsll . homotopy [16] and Bayesian sparse coding u% : however, the 
optimization used in these models is computationally expensive for use in large scale problems like 
object recognition. 

To overcome this, inspired by the method proposed in [18] for structured sparsity, we propose an 
approximate solution that is consistent and able to use efficient solvers like fast iterative shrinkage 
threshol ding alogorithm (FISTA) [19]. The key to our approach is to first use Nestrov's smoothness 
method [[18112011 to approximate the non-smooth state transition term. The resulting energy function 
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is a convex and continuously differentiable function in with a sparsity constraint, and hence, can 
be efficiently solved using proximal methods like FISTA. 

To begin, let l](xt) = ||et||i where = (xt — Ax^_i). The idea is to find a smooth approximation 
to this function l^(xt) in e^. Notice that, since is a linear function on x^, the approximation will 
also be smooth w.r.t. x^. Now, we can re- write ft{-Kt) using the dual norm of ii as 

^{'Kt) = argmaxa^et 

ll«l|oo<l 

where a G M^. Using the smoothing approximation from Nesterov [|20|] on (^(x^): 

n{xt) - f^{et) = argmax[a^et - fid{cx)] (6) 

ll«l|oo<l 

where d{-) = ^||a||2 is a smoothing function and /i is a smoothness parameter. From Nestrov's 
theorem [20], it can be shown that /^(e^) is convex and continuously differentiable in et and the 
gradient of fi^{et) with respect to takes the form 

Ve,^(e,) = a* (7) 

where a* is the optimal solution to f^{et) = argmax[a^et — /i(i(a)] 0. This implies, by using 

ll«l|oo<l 

the chain rule, that f^{et) is also convex and continuously differentiable in Xt and with the same 
gradient. 

With this smoothing approximation, the overall cost function from (O can now be re- written as 

X, = argmin i||y, - Cx,||^ + A/^(e,) + 7||x,||i (8) 

with the smooth part /i(x^) = ^\\yt — Cx^ II2 + ^f^i^t) whose gradient with respect to x^ is given 
by 

Vx,Mxt)=C^(yt-Cxt) + Aa* (9) 
Using the gradient information in (|9l), we solve for xt from ([5]) using FISTA |[T9h . 

Inferring Causes: Given a group of state vectors, Ut can be inferred by minimizing E2, where we 
define a generative model that modulates the sparsity of the pooled state vector, Ix*^^^ | . Here we 
observe that FISTA can be readily applied to infer u^, as the smooth part of the function £^2- 

h^ \ rl + exp(-[Bu^]/e)] ^ 

/c=l n=l 

is convex, continuously differentiable and Lipschitz in Ut 1I2TI1 FI. Following (l^ , it is easy to obtain 
a bound on the convergence rate of the solution. 

Joint Inference: We showed thus far that both x^ and can be inferred from their respective energy 
functions using a first-order proximal method called FISTA. However, for joint inference we have 
to minimize the combined energy function in ^ over both x^ and Ut . We do this by alternately 
updating x^ and while holding the other fixed and using a single FISTA update step at each 
iteration. It is important to point out that the internal FISTA step size parameters are maintained 
between iterations. This procedure is equivalent to alternating minimization using gradient descent. 
Although this procedure no longer guarantees convergence of both x^ and to the optimal solution, 
in all of our simulations it lead to a reasonably good solution. Please refer to Algorithm. [T] (in the 
supplementary material) for details. Note that, with the alternating update procedure, each x^ is now 
influenced by the feed-forward observations, temporal predictions and the feedback connections 
from the causes. 



^Please refer to the supplementary material for the exact form of a*. 

"^The matrix B is initialized with non-negative entries and continues to be non-negative without any addi- 
tional constraints 1I21I1. 
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2.2.2 Parameter Updates 



With xt and fixed, we update the parameters by minimizing £^ in (|5]) with respect to 6. Since the 
inputs here are a time- varying sequence, the parameters are updated using dual estimation filtering 
1I22II : i.e., we put an additional constraint on the parameters such that they follow a state space 
equation of the form: 

Ot = Ot-i^zt (11) 

where zt is Gaussian transition noise over the parameters. This keeps track of their temporal rela- 
tionships. Along with this constraint, we update the parameters using gradient descent. Notice that 
with a fixed and Ut, each of the parameter matrices can be updated independently. Matrices C 
and B are column normalized after the update to avoid any trivial solution. 

Mini-Batch Update: To get faster convergence, the parameters are updated after performing infer- 
ence over a large sequence of inputs instead of at every time instance. With this "batch" of signals, 
more sophisticated gradient methods, like conjugate gradient, can be used and, hence, can lead to 
more accurate and faster convergence. 



2.3 Building a hierarchy 

So far the discussion is focused on encoding a small part of a video frame using a single stage 
network. To build a hierarchical model, we use this single stage network as a basic building block 
and arrange them up to form a tree structure (see FigureJTbll. To learn this hierarchical model, we 
adopt a greedy layer- wise procedure like many other deep learning methods iHlsiliil]- Specifically, 
we use the following strategy to learn the hierarchical model. 

For the first (or bottom) layer, we learn a dynamic network as described above over a group of 
small patches from a video. We then take this learned network and replicate it at several places 
on a larger part of the input frames (similar to weight sharing in a convolutional network f^). 
The outputs (causes) from each of these replicated networks are considered as inputs to the layer 
above. Similarly, in the second layer the inputs are again grouped together (depending on the spatial 
proximity in the image space) and are used to train another dynamic network. Similar procedure can 
be followed to build more higher layers. 

We again emphasis that the model is learned in a layer- wise manner, i.e., there is no top-down 
information while learning the network parameters. Also note that, because of the pooling of the 
states at each layers, the receptive field of the causes becomes progressively larger with the depth of 
the model. 



2.4 Inference with top-down information 

With the parameters fixed, we now shift our focus to inference in the hierarchical model with the 
top-down information. As we discussed above, the layers in the hierarchy are arranged in a Markov 
chain, i.e., the variables at any layer are only influenced by the variables in the layer below and the 

layer above. Specifically, the states xf ^ and the causes uf ^ at layer / are inferred from uf and 

are influenced by xf (through the prediction term C^^+^^xf ^^^)0. Ideally, to perform inference 
in this hierarchical model, all the states and the causes have to be updated simultaneously depending 
on the present state of all the other layers until the model reaches equilibrium [10]. However, such 
a procedure can be very slow in practice. Instead, we propose an approximate inference procedure 
that only requires a single top-down flow of information and then a single bottom-up inference using 
this top-down information. 



^The suffixes n indicating the group are considered implicit here to simplify the notation. 
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For this we consider that at any layer / a group of input 
using a group of states x 



(Z-l,n) 



■t • 



(0 
Ttk 



Vn G {1, 2, N} are encoded 
Vn and the causes u^^^ by minimizing the following energy function: 

N 
n=l 

+Ei7S-*^i) 
fe=i 



u 



2(12) 



70 



1 



where 6^^^ = { A^^^ , B^^\ C^^^ }. Notice the additional term involving u^^^^^ when compared to (O. 

This comes from the top-down information, where we call as the top-down prediction of the 

causes of layer (/) using the previous states in layer (/ + !). Specifically, before the "arrival" of a 
new observation at time t, at each layer (/) (starting from the top-layer) we first propagate the most 

likely causes to the layer below using the state at the previous time instance xf2^ and the predicted 

causes uf More formally, the top-down prediction at layer / is obtained as 



where x 



(0 



arg min A^^^ ||x[ 
^(0 



(0 



K 

k=i 



(13) 



and 7t,fc = (exp(-[B(^)uf+'^],))/2 



At the top most layer, L, a "bias" is set such that uj^^ = u[^\, i.e., the top-layer induces some 
temporal coherence on the final outputs. From (fT3l l, it is easy to show that the predicted states for 
layer / can be obtained as 



^t,k — 



[A(0xf2i]fc, jolt,k < A(^) 
0, 7o7^,fc > A^^) 



(14) 



These predicted causes uf ^ , V/ G {1,2,...,L} are substi tuted in (O and a single layer- wise bottom- 
up inference is performed as described in section 12.2. iFI . The combined prior now imposed on the 



,(0 



.a+l)||2 



causes, 

a smoother and biased estimate of the causes. 



2, is similar to the elastic net prior discussed in [|24l1 . leading to 



3 Experiments 

3.1 Receptive fields of causes in the hierarchical model 

Firstly, we would like to test the ability of the proposed model to learn complex features in the 
higher-layers of the model. For this we train a two layered network from a natural video. Each 
frame in the video was first contrast normalized as described in [13]. Then, we train the first layer 
of the model on 4 overlapping contiguous 15 x 15 pixel patches from this video; this layer has 
400 dimensional states and 100 dimensional causes. The causes pool the states related to all the 

4 patches. The separation between the overlapping patches here was 2 pixels, implying that the 
receptive field of the causes in the first layer is 17 x 17 pixels. Similarly, the second layer is trained 
on 4 causes from the first layer obtained from 4 overlapping 17x17 pixel patches from the video. 
The separation between the patches here is 3 pixels, implying that the receptive field of the causes 
in the second layer is 20 x 20 pixels. The second layer contains 200 dimensional states and 50 
dimensional causes that pools the states related to all the 4 patches. 

Figure [2] shows the visualization of the receptive fields of the invariant units (columns of matrix 
B) at each layer. We observe that each dimension of causes in the first layer represents a group of 

^Note that the additional term ^ ||uf ^ — uf ||i in the energy function only leads to a minor modification 
in the inference procedure, namely this has to be added to /i(ut) in fTOl ). 
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(a) Layer 1 invariant matrix, B*^^^ (b) Layer 2 invariant matrix, B' 



Figure 2: Visualization of the receptive fields of the invariant units learned in (a) layer 1 and (b) layer 
2 when trained on natural videos. The receptive fields are constructed as a weighted combination of 
the dictionary of filters at the bottom layer. 



primitive features (like inclined lines) which are localized in orientation or position Q. Whereas, the 
causes in the second layer represent more complex features, like corners, angles, etc. These filters 
are consistent with the previously proposed methods like Lee et al. [5] and Zeiler et al. [7J. 

3.2 Role of top-down information 

In this section, we show the role of the top-down information during inference, particularly in the 
presence of structured noise. Video sequences consisting of objects of three different shapes (Refer 
to Figure [3]) were constructed. The objective is to classify each frame as coming from one of the 
three different classes. For this, several 32 x 32 pixel 100 frame long sequences were made using 
two objects of the same shape bouncing off each other and the "walls". Several such sequences were 
then concatenated to form a 30,000 long sequence. We train a two layer network using this sequence. 
First, we divided each frame into 12x12 patches with neighboring patches overlapping by 4 pixels; 
each frame is divided into 16 patches. The bottom layer was trained such the 12 x 12 patches were 
used as inputs and were encoded using a 100 dimensional state vector. A 4 contiguous neighboring 
patches were pooled to infer the causes that have 40 dimensions. The second layer was trained with 
4 first layer causes as inputs, which were itself inferred from 20 x 20 contiguous overlapping blocks 
of the video frames. The states here are 60 dimensional long and the causes have only 3 dimensions. 
It is important to note here that the receptive field of the second layer causes encompasses the entire 
frame. 

We test the performance of the DPCN in two conditions. The first case is with 300 frames of clean 
video, with 100 frames per shape, constructed as described above. We consider this as a single video 
without considering any discontinuities. In the second case, we corrupt the clean video with "struc- 
tured" noise, where we randomly pick a number of objects from same three shapes with a Poisson 
distribution (with mean 1.5) and add them to each frame independently at a random locations. There 
is no correlation between any two consecutive frames regarding where the "noisy objects" are added 
(see Figure l3bl). 

First we consider the clean video and perform inference with only bottom-up inference, i.e., during 

inference we consider uf ^ = 0, V/ G {1,2}. Figure l4al shows the scatter plot of the three dimen- 
sional causes at the top layer. Clearly, there are 3 clusters recognizing three different shape in the 
video sequence. Figure [40 shows the scatter plot when the same procedure is applied on the noisy 
video. We observe that 3 shapes here can not be clearly distinguished. Finally, we use top-down 
information along with the bottom-up inference as described in section 12.41 on the noisy data. We 
argue that, since the second layer learned class specific information, the top-down information can 
help the bottom layer units to disambiguate the noisy objects from the true objects. Figure l4cl shows 
the scatter plot for this case. Clearly, with the top-down information, in spite of largely corrupted 
sequence, the DPCN is able to separate the frames belonging to the three shapes (the trace from one 
cluster to the other is because of the temporal coherence imposed on the causes at the top layer.). 



^Please refer to supplementary material for more results. 
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(a) Clear Sequences 



(b) Corrupted Sequences 



Figure 3: Shows part of the (a) clean and (b) corrupted video sequences constructed using three 
different shapes. Each row indicates one sequence. 




4 10 6 10 3 6 



(a) (b) (c) 

Figure 4: Shows the scatter plot of the 3 dimensional causes at the top-layer for (a) clean video 
with only bottom-up inference, (b) corrupted video with only bottom-up inference and (c) corrupted 
video with top-down flow along with bottom-up inference. At each point, the shape of the marker 
indicates the true shape of the object in the frame. 

4 Conclusion 

In this paper we proposed the deep predictive coding network, a generative model that empirically 
alters the priors in a dynamic and context sensitive manner. This model composes to two main com- 
ponents: (a) linear dynamical models with sparse states used for feature extraction, and (b) top-down 
information to adapt the empirical priors. The dynamic model captures the temporal dependencies 
and reduces the instability usually associated with sparse coding 0, while the task specific informa- 
tion from the top layers helps to resolve ambiguities in the lower-layer improving data representation 
in the presence of noise. We believe that our approach can be extended with convolutional methods, 
paving the way for implementation of high-level tasks like object recognition, etc., on large scale 
videos or images. 
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A Supplementary material for Deep Predictive Coding Networks 

A.l From section l2!2.1i computing a* 

The optimal solution of a in Q is given by 

a* =argmax[a^e^ - ^||a|p] 

ll«l|oo<l ^ 

= arg mm a 

||«||oo<l 

-(^) 

where S{.) is a function projecting onto an ^oo-ball. This is of the form: 

' —l<x<l 
S{x) = <( 1, x>l 

-1, X < -1 



(15) 



A.2 Algorithm for joint inference of the states and the causes. 

Algorithm 1 Updating Xt,u^ simultaneously using FISTA-like procedure pi 9*]. 
Require: Take Lg^ > Vn G {1, 2, N}, > and some r] > 1. 

1: Initialize xo,n G Vn G {1, 2, A^}, uq G and set = uq, zi,n = xq,- 
2: Set step-size parameters: n = 1. 
3: while no convergence do 
4: Update 

7 = 7o(l + exp(-[Bu,])/2 



5: forn G {1, 2, AT} do 

6: Line search: Find the best step size L| ^. 

7: Compute a* from ([T5]) 

8: Update x^^^ using the gradient from ^ with a soft-thresholding function. 

9: Update internal variables z^+i with step size parameter as in LI 9.1 . 

10: end for 

11: Compute ^^^^ 1x^,^1 

12: Line search: Find the best step size L^. 

13: Update u^^n using the gradient of ([TOl) with a soft- thresholding function. 

14: Update internal variables ^i+i with step size parameter as in L19i] . 

15: Update 

rz+i = (l + 7(4rf + l))/2 

16: Check for convergence. 

17: i = z + 1 

18: end while 

19: return x^^^ Vn G {1, 2, N} and 
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A.3 Inferring sparse states with known parameters 



3 




Observation Dimensions 

Figure 5: Shows the performance of the inference algorithm with fixed parameters when compared 
with sparse coding and Kalman filtering. For this we first simulate a state sequence with only 20 
non-zero elements in a 500-dimensional state vector evolving with a permutation matrix, which is 
different for every time instant, followed by a scaling matrix to generate a sequence of observations. 
We consider that both the permutation and the scaling matrices are known apriori. The observation 
noise is Gaussian zero mean and variance = 0.01. We consider sparse state-transition noise, 
which is simulated by choosing a subset of active elements in the state vector (number of elements 
is chosen randomly via a Poisson distribution with mean 2) and switching each of them with a 
randomly chosen element (with uniform probability over the state vector). This resemble a sparse 
innovation in the states. We use these generated observation sequences as inputs and use the apriori 
know parameters to infer the states from the dynamic model. Figure [5] shows the results obtained, 
where we compare the inferred states from different methods with the true states in terms of relative 
mean squared error (rMSE) (defined as ||xf*^ — x^^^^||/||xj^^^||). The steady state error (rMSE) 
after 50 time instances is plotted versus with the dimensionality of the observation sequence. Each 
point is obtained after averaging over 50 runs. We observe that our model is able to converge to 
the true solution even for low dimensional observation, when other methods like sparse coding fail. 
We argue that the temporal dependencies considered in our model is able to drive the solution to the 
right attractor basin, insulating it from instabilities typically associated with sparse coding ||2^ . 
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A.4 Visualizing first layer of the learned model 



Active state Corresponding, 
element at (t-1 ) predicted states at (t) 




(a) Observation matrix (Bases) (b) State-transition matrix 

Figure 6: Visualization of the parameters. C and A, of the model described in section [3Jl (A) 
Shows the learned observation matrix C. Each square block indicates a column of the matrix, 
reshaped as x pixel block. (B) Shows the state transition matrix A using its connections 
strength with the observation matrix C. On the left are the basis corresponding to the single active 
element in the state at time {t — 1) and on the right are the basis corresponding to the five most 
"active" elements in the predicted state (ordered in decreasing order of the magnitude). 




(a) Connections (b) Centers and Orientations (c) Orientations and Frequencies 



Figure 7: Connections between the invariant units and the basis functions. (A) Shows the connec- 
tions between the basis and columns of B. Each row indicates an invariant unit. Here the set of 
basis that a strongly correlated to an invariant unit are shown, arranged in the decreasing order of 
the magnitude. (B) Shows spatially localized grouping of the invariant units. Firstly, we fit a Gabor 
function to each of the basis functions. Each subplot here is then obtained by plotting a line indicat- 
ing the center and the orientation of the Gabor function. The colors indicate the connections strength 
with an invariant unit; red indicating stronger connections and blue indicate almost zero strength. 
We randomly select a subset of 25 invariant units here. We observe that the invariant unit group 
the basis that are local in spatial centers and orientations. (C) Similarly, we show the correspond- 
ing orientation and spatial frequency selectivity of the invariant units. Here each plot indicates the 
orientation and frequency of each Gabor function color coded according to the connection strengths 
with the invariant units. Each subplot is a half-polar plot with the orientation plotted along the angle 
ranging from to tt and the distance from the center indicating the frequency. Again, we observe 
that the invariant units group the basis that have similar orientation. 
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